After tidying our data set, an exploratory analysis is conducted to look for possible predictors for the Attendance outcome.

Descriptive Analysis for All Data:

Total Attendance by Type from 2019 to 2022

A brief summary of attendance based on the Type variable is provided below:

theme_park  |> 
    group_by(Year, Type) |> 
    mutate(
      Attendance = Attendance / 100000
    ) |> 
    summarise(sum = sum(Attendance)) |> 
    arrange(Type) |> 
    pivot_wider(
      names_from = Type,
      values_from = sum
    ) |> 
    knitr::kable(digits = 3, caption = c("Summary of Attendance for Three Types of Facilities From 2019 to 2022"))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
Summary of Attendance for Three Types of Facilities From 2019 to 2022
Year Amusement/Theme Park Museum Water Park
2019 37996.4 20100.8 5898.9
2020 13031.1 4664.5 2313.5
2021 22463.7 6459.0 3473.5
2022 21280.8 11603.3 4678.3

From this table, some observed patterns are:

  • A common decreasing trend of attendance from 2019 and 2020
  • A common increasing trend of attendance from 2020 and 2022

The distribution of data by year is further visualized into the box plots below:

theme_park |> 
  group_by(Year) |> 
  plot_ly(y = ~Attendance, color = ~Year, type = "box", colors = "viridis") |> 
  layout(annotations = 
 list(x = 1, y =1, text = "Plot 1: Distribution of Attendance by Year", 
      showarrow = F, xref='paper', yref='paper', 
      xanchor='right', yanchor='auto', xshift=0, yshift=0,
      font=list(size=15))
 )
  • Overall high attendance in 2019 and low attendance in 2020
  • Common outliers in the distribution of attendance for each year

Total Attendance by Region from 2019 to 2022

Next, we specifically look at the trend of Attendance from 2019 to 2022 based on the Region variable.

 theme_park|> 
  group_by(Region, Year)  |> 
  summarize(attend_sum = mean(Attendance)) |> 
  plot_ly(x = ~Year, y = ~attend_sum, color = ~Region,
          type = "scatter", mode = 'point', colors = "viridis") |> 
    layout(annotations = 
 list(x = 1, y = 1, text = "Plot 2: Change in Attendance for Each Region", 
      showarrow = F, xref='paper', yref='paper', 
      xanchor='right', yanchor='auto', xshift=0, yshift=0,
      font=list(size=15))
 )
## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.
  • Similar attendance fluctuation for most regions
  • Only EMEA has an attendance drop from 2021 to 2022

Descriptive Analysis For All Data (Worldwide Top 25)

theme_full = 
  read_csv("ultimate data.csv")
## Rows: 920 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Park_Name, City, Country, Type, Region
## dbl (2): Year, Attendance
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Top 25 theme park from worldwide’s sum attendance by different country.

Looking at the plot 1, we found that there was too much data for outliers, so we did another data analysis based on Top 25 theme park, which is labled as Worldwide on the region variable.

We use “Worldwide” to rank the top 25 theme parks by total number of visitors in different countries. According to the icon, we can get:

ANOVA test

Based on Type of Facilities

The first ANOVA test focuses on the Type variable in our data set. The null hypothesis and alternative hypothesis are listed as follow:

\[H_0: \mu_{\text{Amusement/Theme Park}} = \mu_{\text{Water Park}} = \mu_{\text{Museum}} ~~ \text{vs} ~~ H_1: \text{at least two means are not equal}\]

anova_1 = aov(Attendance ~ Type, data = theme_park)

summary(anova_1)
##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## Type          2 1.127e+17 5.635e+16   105.3 <2e-16 ***
## Residuals   737 3.944e+17 5.351e+14                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a p-value of less than 2e-16, we would reject the null hypothesis. We have evidence that at least two of the means are not equal. Meaning the mean attendance among type groups is different for at least two groups in the Type variable.

Based on thr Type of Year

The Second ANOVA test focuses on the Year variable in our data set. The null hypothesis and alternative hypothesis are listed as follow:

\[H_0: \mu_{\text{2019}} = \mu_{\text{2020}} = \mu_{\text{2021}}= \mu_{\text{2022}} ~~ \text{vs} ~~ H_1: \text{at least two means are not equal}\]

dat = 
  theme_full |> 
  filter(
    Region != c("Worldwide")
    ) |> 
  mutate(
    Year = as.factor(Year)
  ) 

anova2 = aov(Attendance ~ Year , data = dat) |> 
  summary()

With a p-value of less than 2e-16, we would reject the null hypothesis. We have evidence that at least two of the means are not equal. Meaning the mean attendance among year groups is different for at least two groups.